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' Covariance estimation becomes challenging in the regime where the number p of variables outstrips 

the number n of samples available to construct the estimate. One way to circumvent this problem is 
to assume that the covariance matrix is nearly sparse and to focus on estimating only the significant 
0^ ' entries. To analyze this approach, Levina and Vershynin (2011) introduce a formalism called masked 

■ covariance estimation, where each entry of the sample covariance estimator is reweighted to reflect an a 
priori assessment of its importance. 

This paper provides a short analysis of the masked sample covariance estimator by means of a matrix 
concentration inequality. The main result applies to general distributions with at least four moments. 

■ Specialized to the case of a Gaussian distribution, the theory offers qualitative improvements over ear- 
' lier work. For example, the new results show that n = 0{Blog^ p) samples suffice to estimate a banded 

covariance matrix with bandwidth 6 up to a relative spectral-norm error, in contrast to the sample com- 
plexity n = 0(i?log^ p) obtained by Levina and Vershynin. 

Keywords: Covariance estimation, matrix concentration inequality, matrix Khintchine inequality, matrix 
Rosenthal inequality, random matrix, Schur product. 
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1. Introduction 

A fundamental problem in multivariate statistics is to obtain an accurate estimate of the covariance 
matrix of a multivariate distribution given independent samples from the distribution. This challenge 
arises whenever we need to understand the spread of the data and its marginals, for example, when we 
perform regression analysis [12] or principal component analysis [18]. 

In the classical setting where the number of samples exceeds the number of variables, the behavior 
of standard covariance estimators is well understood [17, 28, 29]. The random matrix literature also 
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contains a substantial amount of relevant work; we refer to the book [2] and the survey [39] for further 
information. 

Modern applications, in contrast, often involve a small number of samples and a large number of 
variables. The paucity of data makes it impossible to obtain an accurate estimate of a general covariance 
matrix. As a remedy, we must frame additional model assumptions and develop estimators that exploit 
this extra structure. Over the last few years, a number of papers, including [4, 5, 8, 11, 13, 33], have 
focused on the situation where the covariance matrix is sparse or nearly so. In this case, we imagine 
that we could limit our attention to the significant entries of the covariance matrix and thereby perform 
more accurate estimation with fewer samples. 

This paper studies a particular technique for the sparse covariance problem that we call the masked 
sample covariance estimator. This approach uses a mask matrix, constructed a priori, to specify the 
importance we place on each entry of the covariance matrix. By reweighting the sample covariance 
estimate using a mask, we can reduce the error that arises from imprecise estimates of covariances that 
are small or zero. The mask matrix formalism was introduced by Levina and Vershynin [24] to provide 
a unified treatment of some earlier methods for sparse covariance estimation; we refer to their paper for 
a more detailed discussion of prior work. 

This paper provides a new analysis of the masked sample covariance estimator using some recent 
ideas from random matrix theory [1, 22, 27, 30, 34, 35, 37-39]. These methods, collectively known 
as matrix concentration inequalities, are particularly well suited for studying a sum of independent 
random matrices. The results provide strong bounds on the moments and exponential moments of the 
spectral norm of the sum by harnessing information about the individual summands. Indeed, matrix 
concentration inequalities can be viewed as far-reaching extensions of the classical inequalities for a 
sum of scalar random variables [10]. As we demonstrate in this work, matrix concentration inequalities 
sometimes allow us to replace devilishly hard calculations with simple arithmetic. 

One of our main reasons for writing this paper is to show that matrix concentration inequalities 
can streamline the analysis of random matrices that arise in statistical applications. We believe that the 
simplicity of our arguments and the strength of our conclusions make a compelling case for the value of 
these methods. Indeed, we hope that matrix concentration inequalities will find a place in the toolkit of 
researchers working on multivariate problems in statistics. 

1 . 1 Classical Covariance Estimation 
Consider a random vector 



Let a^i , . . . a;„ be independent random vectors that follow the same distribution as x. For simplicity, we 
assume that the distribution is known to have zero mean: E a; = 0. The covariance matrix E is the px p 
matrix that tabulates the second-order statistics of the distribution: 



where * denotes the transpose operation. The classical estimator for the covariance matrix is the sample 
covariance matrix: 



(1.1) 




i =l ■ 



(1.2) 



The sample covariance matrix is an unbiased estimator of the covariance matrix: E i7„ = i7. 
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Given a tolerance e G (0, 1), we can study how many samples n are typically required to provide an 
estimate with relative error e in the spectral norm: 

E||i;„-i;|| s$e||i:||. (i.3) 

This type of spectral-norm error bound is quite powerful. It limits the magnitude of the estimation error 
for each entry of the covariance matrix; it provides information about the variance of each marginal of 
the distribution of a;; it even controls the error in estimating the eigenvalues of the covariance using the 
eigenvalues of the sample covariance. 

Unfortunately, the error bound (1.3) for the sample covariance estimator demands a lot of samples. 
Indeed, suppose that the covariance matrix has full rank. When n < p, the sample covariance is rank- 
deficient, so the spectral norm error is bounded away from zero ! 

Typical positive results state that the sample covariance estimator is precise when the number of 
samples is proportional to the number of variables, provided that the distribution decays fast enough. 
For example, assuming that x follows a normal distribution, 

n^Ce'^p =^ 1 1 X',, - i: 1 1 ^ £ 1 1 i: j I with high probabiUty. (1.4) 

We use the convention that C denotes an absolute constant whose value may change from appearance to 
appearance. See [39, Thm. 57 et seq.] for details of obtaining the bound (1.4). 



1 .2 The Masked Sample Covariance Estimator 

In the regime n <C p, where we have very few samples, we cannot hope to achieve an estimate like (1.3) 
for a general covariance matrix. Instead, we must instate additional assumptions and incorporate this 
prior information to construct a regularized estimator Over the last few years, researchers have studied 
the case where the covariance matrix is sparse or nearly sparse. In this setting, we can often refine our 
estimation procedure by focusing on the most significant entries of the covariance matrix. 

One way to formalize this idea is to construct a symmetric px p matrix M with real entries, which 
we call the mask matrix. In the simplest case, the mask matrix has 0-1 values that indicate which entries 
of the covariance we attend to. A unit entry m^j = 1 means that we estimate the interaction between the 
/th and jth variables, while a zero entry mij = means that we abdicate from making the estimate. More 
generally, we can allow the components of the mask to range over the interval [0, 1], in which case the 
size of mij is proportional to the importance of estimating the (ij) entry of the covariance matrix. 

Given a mask M, we define the masked sample covariance estimator M I!„, where the symbol 
denotes the componentwise (i.e., Schur or Hadamard) product. The following expression bounds the 
root-mean-square spectral-norm error that this estimator incurs. 



E||M0i;„-i;||^ 



1/2 



EWMQSn-MQUl 



1 2' 



1/2 



\MQl!-E\\. (1.5) 



bias 



The second term in (1.5) represents the bias in the estimate owing to the presence of the mask, while the 
first term measures how much the estimator fluctuates about its mean value. This bound is analogous 
with the classical bias-variance decomposition for the mean-squared-error (MSB) of a point estimator. 

To obtain an effective estimator, we must design a mask that controls both the bias and the variance 
in (1.5). We cannot neglect too many components of the covariance matrix, or else the bias in the 
masked estimator may compromise its accuracy. At the same time, each additional component we 
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estimate contributes to the size of the variance term. In the case where the covariance matrix is sparse, it 
is natural to strike a balance between these two effects by refusing to estimate entries of the covariance 
that we know a priori to be small or zero. 

Many of the regularization techniques for sparse covariance estimation studied in the literature, 
such as [5, 8, 13], can be described using mask matrices. These works focus on specific cases, such as 
banded masks and tapered masks, whereas we have followed Levina and Vershynin [24] by allowing an 
arbitrary symmetric mask M . We refer to the papers cited in this paragraph for further background and 
references. 

Remark 1 . 1 (Adapted Masks) In statistical practice, it may be more natural to estimate the mask matrix 
M from the observed samples, rather than to construct the mask a priori. A number of authors, includ- 
ing El Karoui [11], have studied the performance of covariance estimators with an adaptive threshold. 
In this work, we focus on the simpler case where the mask is fixed. It presents an interesting challenge 
to analyze a data-dependent mask using matrix concentration inequalities. 



1.2.1 Example: The Banded Estimator of a Decaying Covariance Matrix. Let us consider the case 
where the entries of the covariance matrix S decay away from the diagonal. Suppose that, for a fixed 
parameter a > 1, 

|(i:),7| s; |/-; + ir" for each pair of indices. 

This type of property might hold for a random process whose correlations are localized in time. (That 
is, the current value of the process depends weakly on the past and the future.) Related covariance 
structures arise for random fields that have short spatial correlation scales. 

A simple (suboptimal) approach to this covariance estimation problem is to focus on a band of 
entries near the diagonal. Suppose that the bandwidth B :=2b^-\ for a nonnegative integer b. For 
instance, a mask with bandwidth B = 3 for an ensemble of = 5 variables takes the form 



band ■ 



1 

1 1 
1 1 1 
1 1 
1 



In this setting, it is straightforward to compute the bias term in (1.5). Indeed, 



|(M0i;- 




\i-.i\>b 
otherwise. 



Gershgorin's theorem [15, Sec. 6.1] implies that the spectral norm of a symmetric matrix is dominated 
by the maximum ii norm of a column, so 



lM0i;-i;|| <2^(/t+i)-" ^ 

k>b 



a-l 



{b + \) 



The second inequality follows when we compare the sum with an integral. A similar calculation shows 
that ^ 1 +2(a — 1)^^ Assuming the covariance matrix really does have constant spectral norm, 
it follows that 
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On a relative scale, the bias decreases polynomially as we increase the bandwidth B of the mask. Note 
that this estimate follows from an easy application of classical matrix analysis. 

We cannot complete the bound (1.5) for the estimation error without understanding the behavior of 
the fluctuation term. In contrast to the bias term, this analysis is challenging, and it requires an excursion 
into the field of random matrix theory. 

1 .3 The Performance of Masked Covariance Estimation 

This paper studies the variance term in the error bound (1.5) for the masked sample covariance estimator 
To perform this analysis, we must address a variety of issues; How does the structure of the mask M 
affect the performance of the estimator? How many samples n do we need to control the size of the 
fluctuation? What role does the distribution of the underlying random vector x play? 

In Section 3, we use a matrix concentration inequality to obtain a bound for the variance term 
in (1.5) that holds for any distribution on x with four finite moments. In the Introduction, we focus 
on the simpler setting where the random vector follows a normal distribution. Our theory for this case 
highlights the factors that affect the performance of the estimator We examine how the structure of the 
mask enters into the error bound, and we describe how the error decreases with the number of samples. 
We also discuss how the correlation among variables affects the difficulty of the covariance estimate. 

Note that this work focuses on the random matrix aspects of the masked sample covariance estimator. 
As a consequence, we purposely avoid a detailed discussion of the statistical issues. In particular, we 
make no further study of the (deterministic) bias term in (1.5). Nor do we presume to make any claims 
about statistical practice. 

1.3.1 The Complexity of a Mask. The number of samples we need to control the variance in (1.5) 
depends on "how much" of the covariance matrix we are attempting to estimate. A "complex" mask 
requires us to estimate many interactions between variables, so we need a lot of samples to limit the 
fluctuation of the masked sample covariance estimator. In this section, think about masks that take 0-1 
values to gain intuition. 

Our analysis identifies two separate metrics that quantify the complexity of the mask. The matrix 
norm || H returns the maximum I2 norm of a column. The first complexity measure is the square of 
the maximum column norm: 

||M||2^2-max,[£.m2.]. 

Roughly, the bracket counts the number of interactions we want to estimate that involve the variable j, 
and the maximum computes a bound over all p variables. This metric is "local" in nature. The second 
complexity measure is the spectral norm || Af || of the mask matrix, which provides a more "global" view 
of the complexity of the interactions that we estimate. 

Some examples may illuminate how these metrics reflect the properties of the mask. First, suppose 
that we estimate the entire covariance matrix, so the mask is the matrix of ones: 

M = matrix of ones =^ \\M\\\^2=P 11-^11 =P- 

Next, consider the mask that arises from the banded estimator in Section 1.2.1: 

M = 0-1 matrix, bandwidth B =^ ||M||i^2^'8 and ||M|1^B 

because there are at most B ones in each row and column. When B <^ p, the banded mask asks us to 
estimate fewer interactions than the full mask, so we expect the covariance estimate to be easier 
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Remark 1.2 (Are the Two Metrics Really Different?) In the examples above, the two metrics take 
the same value, but this coincidence does not always occur Although the spectral norm dominates the 
maximum ^2 norm of a column, the square of the maximum column norm can be substantially larger or 
substantially smaller than the spectral norm. 

Remark 1.3 (Scaling) We can reduce the size of both complexity measures by rescaling the mask, but 
changing the scale of the mask may also increase the bias term in (1.5). When studying the masked 
sample covariance estimator in the context of a particular apphcation, it is essential to consider both the 
variance and the bias terms. 



1.3.2 Masked Covariance Estimation for Multivariate Normal Distributions. We are now prepared 
to present our main result for the case where the random vector x follows a normal distribution with 
zero mean. The statement involves the norm which returns the maximum absolute entry of a 

matrix. 

Theorem 1.1 (Masked Covariance Estimation for a Gaussian Distribution) Fix a p x p symmetric 
mask matrix M, where p ^ 3. Suppose that a; is a Gaussian random vector in M.P with mean zero. 
Define the covariance matrix S and the sample covariance matrix S„ as in (1.1) and (1.2). Then the 
variance of the masked sample covariance estimator satisfies 



E||M0i:„-M0i:| 



1/2 



\M\\U2^ogp 



\M\\logp -login p) 



(1.6) 



The proof of Theorem 1 . 1 appears in Section 3.4. The rest of this section consists of some discussion 
of the result, as well as comparisons with related work. 



1.3.3 Extension to Distributions with Nonzero Mean. In the actual practice of covariance estimation, 
we would center each sample empirically by subtracting the sample mean x = n^^Y,'i=i The sample 
covariance (1.2) is computed using the centered samples a;, = Xj — x instead of the original samples a;,. 
Theorem 1 . 1 can be extended to cover the masked covariance estimator formed with centered samples. 
See [24, Rem. 4] for the details of the argument. 



1 . 3 .4 Discussion and Interpretation. Theorem 1 . 1 exposes several phenomena in the behavior of the 
masked sample covariance estimator The first term on the right-hand side of (1.6) reflects the scale for 
moderate deviations of the estimator, and it depends on the "local" complexity of the mask. The second 
term on the right-hand side of (1.6) reflects the scale for large deviations of the estimator. It depends on 
the "global" complexity of the mask. When the sample size n is large, the first term drives the bound 
because the second term usually decays faster 

The ratio of the maximum entry of the covariance matrix to the spectral norm is an interesting feature 
of (1.6). The ratio never exceeds one, but it can be as small as p^'^ when the covariance matrix has rank 
one. We interpret this factor as saying that covariance estimation is easier when the variables are highly 
correlated. 
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1.3.5 Sample Complexity Bound. Markov's inequality can be used to convert (1.6) into an error 
bound that holds in probability, so Theorem 1.1 also allows us to develop conditions on the number 
« of samples that we need to control the size of the fluctuation. To obtain the sample complexity, 
assume that « ^ p, and select an error tolerance e G (0, 1). Then there is a constant €99% for which 

IWMWl^^logp , ||M||log2p 



with probability at least 99%. See the discussion after Theorem 3.1 for information about how to obtain 
sample complexity bounds that hold with higher probability. 




MQSn~MQS\\ ^e\\S\\ (1.7) 



1.3.6 Example: The Banded Covariance Estimator. Consider the banded covariance estimation prob- 
lem in Section 1.2.1, with the mask 

M = 0-1 matrix with bandwidth B. 

The sample complexity bound (1.7) and the norm calculations from Section 1.3.1 demonstrate that 



n^C 



Blogp Blog p 



(1.8) 



is sufficient to obtain a relative spectral-norm error e with constant probability. In particular, the con- 
dition n > Blog^ p always ensures a constant relative error in (1.6). It follows that, when B ^ p, the 
variance of the estimator can be small, even when the number of samples is much smaller than the total 
number of variables. 



1.3.7 Comparison with Bounds of Levina and Vershynin. The most natural point of comparison for 
Theorem 1.1 is the main theorem of Levina and Vershynin [24, Thm. 2.1]. Their result states that, for a 
centered normal random vector x, the fluctuation in the masked sample covariance estimator satisfies 



E\\MqS„-MqS\\ s^C 



The associated sample complexity bound is 



\M\\U2^og^P 



1/2 



|M||log3p 



n^C 



|M||?^2log'p , \\M\\log'p 



(1.9) 



Our sample complexity bound (1.7) has a structure similar to (1.9), but several improvements are worth 
mentioning. First, the ratio of norms is a new feature in our estimate (1.7). The second improvement 
over (1.9), which has less conceptual significance, is the reduction of the number of logarithmic factors. 
Finally, our main result, Theorem 3.1 covers all centered distributions with four finite moments. 



1.4 Organization of the paper 

The rest of the paper is organized as follows. Section 2 introduces our notation and some preliminaries. 
Section 3 presents the main result for zero-mean distributions with finite fourth moments, together with 
its proof and the proof of Theorem 1.1. In Section 4, we deal with the technical estimates at the heart of 
the main result. Appendix A establishes the matrix concentration inequality we require. 
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2. Preliminaries 

This section sets out the background material we need for the proof. Section 2.1 summarizes our nota- 
tional conventions, and Section 2.2 describes some basic properties of the Schur product. 

2.1 Notation and Conventions 

In this paper, we work exclusively with real numbers. Plain italic letters always refer to scalars. Bold 
italic lowercase letters, such as a, refer to column vectors. Bold italic uppercase letters, such as A, 
denote matrices. All matrices in this work are square; the dimensions are determined by context. We 
write for the zero matrix and I for the identity matrix. The matrix unit E,y has a unit entry in the {ij) 
position and zeros elsewhere. 

The symbol * denotes the transpose operation on vectors and matrices. We use the term self-adjoint 
to refer to a matrix that satisfies vl = ^* to avoid confusion with symmetric random variables. Curly 
inequalities refer to the positive-semidefinite partial ordering on self-adjoint matrices: A ^ B if and 
only if JB — A is positive semidefinite. 

The function diag( ) maps a vector a to a matrix whose diagonal entries correspond with the entries 
of a. When applied to a matrix, diag( ) zeroes out the off-diagonal entries. We write tr( ) for the trace. 
The symbol denotes the componentwise (i.e., Schur or Hadamard) product of two matrices. 

We write || • || for both the I2 vector norm and the associated operator norm, which is usually called 
the spectral norm. The symbol || -H refers to the Schatten ^-norm of a matrix: 



where \ A\ :~ The norm ||-||^ returns the maximum absolute entry of a vector, but we use a 

separate notation |i • linj^x ^'^^ '■^^ maximum absolute entry of a matrix. We also require the norm 



The notation reflects the fact that this is the natural norm for linear maps from £1 into £2- 

We reserve the symbol £, for a Rademacher random variable, which takes the two values ±1 with 
equal probability. We also assume that all random variables are sufficiently regular that we are justified 
in computing expectations, interchanging limits, and so forth. 

2.2 Facts about the Schur Product 

The proof depends on some basic properties of Schur products. The first result is a simple but useful 
algebraic identity. For each square matrix A and each conforming vector x. 



The second result states that the Schur product with a positive-semidefinite matrix is order preserving. 
That is, for a fixed positive-semidefinite matrix A, 



Mil,:=[tr|^n' 




AQxx* = diag(a;) Adiag(a;). 



(2.1) 



B\ ^ B2 impHes A Bi =<; A B2- 



(2.2) 



This property follows from Schur's theorem [16, Thm. 7.5.3], which states that the Schur product of 
two positive-semidefinite matrices remains positive semidefinite. 
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3. Masked Covariance Estimation 

In this section, we state and prove detailed error estimates for masked covariance estimation of a general 
distribution with finite fourth moments. Section 3.1 defines two concentration parameters that measure 
the spread of the distribution. We present the main theorem and a short discussion in Sections 3.2 
and 3.3. In Section 3.4, we show how to derive Theorem 1.1, the result for Gaussian distributions. The 
proof of the main result appears in Section 3.5. 



3 . 1 Concentration Parameters 

The effectiveness of the masked sample covariance estimator depends on the concentration properties 
of the distribution of x. Let us introduce two quantities that measure different facets of the variation of 
the random vector. 

For r ^ 1, the rtli diagonal moment pLr{x) of the distribution is defined to be the maximum Lr norm 
of a single component of the vector: 

M,-(a;):=max,•(E|X,■|'■)'/^ (3.1) 

In other words, jU,- gives us uniform control on the rth moment of each component of x. 

We also require some information about the spread of the distribution in all directions. Define the 
uniform fourth moment v{x) by the formula 




(3.2) 



The uniform fourth moment measures how much the worst marginal varies. 

Note that both pLr{x) and v{x) have the same homogeneity as the random vector x. (This property 
is sometimes expressed by saying that the quantities have the same dimension, the same units, or the 
same scaling.) As a consequence, the quantities ^r{x)v{x) and ll}{x) have the same homogeneity 
as the covariance matrix S. In the sequel, we abbreviate fir ^i-r{x) and V := v{x) whenever the 
distribution of the random vector x is clear. 



3.2 Main Result for Masked Covariance Estimation 

The following theorem provides detailed information about the variance of the error in the masked 
sample covariance estimator for a zero-mean distribution with finite fourth moments. 

Theorem 3.1 (The Masked Sample Covariance Estimator) Fix a p x p symmetric mask matrix M, 
where p ^ 3. Suppose that a; is a random vector in M'' with mean zero. Define the covariance matrix 
E and the sample covariance matrix S,, as in (1.1) and (1.2). Then the variance in the masked sample 
covariance estimator satisfies 



E\\M(Z)S„-MqS\ 



1/2 



8elog/7 8elog/7 



E max,- 1 1 CD; j 



1/2 



Furthermore, the expected maximum satisfies the bound 



E max; 1 1 a;,- 1 



1/2 



^inf M^/^'-.^l- 



(3.3) 



(3.4) 
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The diagonal moment ji^ and the uniform fourth moment v are defined in (3. 1) and (3.2). 

In Section 3.3, we offer a short discussion of this result. Afterward, in Section 3.4, we specialize 
the result to Gaussian distributions, which establishes Theorem 1 . 1 of the Introduction. The proof of 
Theorem 3.1 appears below in Section 3.5. 



3.3 Discussion 

Theorem 3.1 has a wider scope that most of the results in the literature on sparse co variance estimation. 
Indeed, we allow completely general masks, and the bound is valid for any distribution with finite 
fourth moments. When we specialize the result to the Gaussian case, we obtain an improvement over 
prior work [24, Thm. 2.1]. Even so, our argument, which is based on a matrix moment inequality, is 
very direct. 

For simplicity, we have presented Theorem 3.1 as a bound on the variance of the masked sample 
covariance estimator. A refinement of the same argument allows us to compute higher moments of the 
error, which in turn yield polynomial tail bounds via Markov's inequality. When the distribution of the 
random vector x is subgaussian, this method even yields exponential tail bounds. 

Remark 3.1 (Alternative Arguments) The proof of Theorem 3.1 is based on a new matrix moment 
inequality. We can obtain similar results using other matrix concentration inequalities that appear in the 
literature. In particular, the matrix Rosenthal inequality [27, Cor. 7.5] leads to a very similar bound. The 
initial version of this manuscript [9] uses the matrix Bernstein inequality [38] to develop a version of 
Theorem 3.1 for a subgaussian random vector x. 



3.4 Specialization to Gaussian Distributions 

It is natural to apply Theorem 3.1 to study the performance of masked covariance estimation for a 
zero-mean Gaussian random vector In this case, the covariance matrix determines the distribution com- 
pletely, so we can obtain a more transparent statement that does not involve the concentration parame- 
ters. Theorem 1.1 follows from these considerations. 

Proof of Theorem 1.1 from Theorem 3.1. First, we compute the (2r)th diagonal moment li2r{x) for 
r ^ 1 . Observe that the ;th component X, of the vector a; is a centered normal random variable with 
variance a,,-, where a,-, denotes the /th diagonal entry of S. Using the standard expression for the (2r)th 
moment of a normal random variable, we obtain 

mt^''^-a'n^r'--al. 

This bound is valid for each real number r ^ 1 . Therefore, taking the rth root, we reach 

^ r ■ max; aii = r\\S\\ . (3 .5) 

The identity holds because the maximum entry of a positive-definite matrix occurs on its diagonal. In 
particular, we see that 

li,^V2\\S\\ll^,. (3.6) 
Next, we instantiate the bound (3.4) on the expected maximum. Choose 4r — 2log{np) to reach 



imax; j|a;,-||^ 



«;eM2'iog(„rt '^elogM-liriU,,, (3-7) 
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owing to (3.5). 

Finally, we bound the uniform fourth moment v{x). Fix a unit vector u. The distribution of the 
marginal u*x is Gaussian with mean zero. To compute the variance of the marginal, we write 
X = where g is a standard Gaussian vector. Then 

al=E\u*x\^ = E\u*{S^/^g)\- = u* E^/^{Egg*)S^/^u = u* Su s$ . 

The fourth moment of a Gaussian variable equals three times its squared variance, so 

E\u*xt ^3al^3\\E\\^ . 

We conclude that the uniform fourth moment satisfies 

v{x)= sup (E|u*a;|'^)i/'*<;3'/'*||i:||'/^ (3.8) 

||ii||=i 

To complete the proof of Theorem 1.1, substitute the bounds (3.6), (3.7), and (3.8) into the inequal- 
ity (3 . 3) from Theorem 3.1. □ 

3.5 Proof of Theorem 3. 1 

The proof of Theorem 3.1 proceeds in several short steps. First, we write the variance of the estimator as 
a sum of independent random matrices, and we use symmetrization to simplify the expression. Second, 
we apply a matrix moment inequality to bound the variance of the estimator in terms of the spectral 
norm of a matrix variance and the maximum spectral norm of the summands. Finally, some short 
computations, which appear in Section 4, yield bounds for the remaining terms. 

3.5. 1 Symmetrization. We begin by assigning a name to the quantity of interest: 

E:=E\\MqS„~MqS\\. 

The random matrix inside the norm has a natural expression as a sum of independent, centered random 
matrices. To see why, substitute the definitions (1.1) and (1.2) of the population covariance matrix S 
and the sample covariance matrix X'„ to obtain 

E = --E WY", (M XiX* - EM Xix*)\\ . 
The standard symmetrization method [23, Lem. 6.3] yields the bound 

E^--E\\Y,li^i{MQx,x*)\\. (3.9) 

Here, {^,} is a sequence of independent Rademacher random variables that is also independent from the 
sequence {a;,} of samples. The advantage of the expression (3.9) is that each Schur product involves a 
rank-one matrix, which greatly simplifies our computations. 
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3.5.2 The Spectral Norm of an Independent Sum. The main technical tool in this paper is a bound 
for the second moment of the spectral norm of a sum of independent, symmetric random matrices. 

Theorem 3.2 (Matrix Second Moment Inequality) Assume that p ^ 1>. Consider a finite sequence 
{ Yi] of independent, symmetric, random, self-adjoint matrices with dimension p x p. Then 



1/2 



21 1/2 



-4elog/7 • 



E max; II Y,\ 



1/2 



Theorem 3.2 reduces the challenging problem of bounding the second moment of the spectral norm 
to two simpler calculations. We interpret the first term as the variance of a sum of independent, sym- 
metric random matrices. The second term measures the typical size of the largest summand. The result 
is new in the form that we present it, but it has strong precedents in the literature. See Appendix A for 
the proof and a discussion of related work. 

With Theorem 3.2 at hand, it is straightforward to bound (3.9). We reach 



£^-v/8elogp [Y^M{MQXiX*f] 



21 1/2 



8e log p 



8e log p 



\\R{MQxx*) 



2||i/2 8elog/9 



E max,- 1 1 Af XjX* 

1/2 



1/2 



Emax,- ||M0 a;/a;* 



n " " n 

The second line follows from the identical distribution of the summands. 



(3.10) 



3.5.3 The Matrix Variance and the Maximum Spectral Norm. All that remains is to calculate the 
matrix variance that appears in the first term of (3.10) and the expected maximum norm that appears in 
the second term. Lemma 4. 1 demonstrates that 

V.{MQxx*f ^^lv^-\\M\\\^^-\. (3.11) 

The concentration parameters /i^ and v that characterize x are defined in (3.1) and (3.2). Lemma 4.2 
provides a simple deterministic estimate for the remaining Schur product: 

||M0a;a;*K||M||||a;||i. (3.12) 

Introduce the matrix variance bound (3.11) and the Schur product bound (3.12) into (3.10) to obtain 



,8elogp II,, II 8elogn 
^.||M||i^2-M4V + ^-IIM 



Emax,- ||a;;||l 



1/2 



(3.13) 



To incorporate the semidefinite second moment bound, we have used the fact that the spectral norm is 
monotone with respect to the order =^ on the set of positive semidefinite matrices. This is the first claim 
in Theorem 3.1. 

To establish the remaining claim (3.4), we need a bound for the expected maximum of ||a;,||f^. 
Lemma 4.3 states that 

Emax, ||a;,||l < inf {npYI' ■ ^i. (3.14) 

r>l 

This observation completes the proof. 



4. Computing the Matrix Variance and the Maximum Spectral Norm 

In this section, we complete the calculations that stand at the center of Theorem 3.1. 
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4.1 A Semidefinite Bound for the Matrix Variance 

First, we study the matrix variance E(M xx*)^. This calculation requires some insight, and it is the 
main novelty in our proof. The key idea is that the monotonicity (2.2) of the Schur product allows us to 
replace one factor in the product by a scalar matrix. This act of diagonalization simplifies the estimate 
tremendously because we erase the off-diagonal entries when we take the Schur product with an identity 
matrix. 

Lemma 4.1 (Matrix Variance Bound) Fix a self-adjoint p x p matrix M . Let x = {X\ ,Xp)* be a 
random vector. Then 

E{MQxx*f ^ ^|v2||M|j2^2-I- 
The concentration parameters 114 and v are defined in (3.1) and (3.2). 

Proof. To begin, we perform some algebraic manipulations to consolidate the randomness. The Schur 
product identity (2.1) implies that 

{MQxx*y^ = (diag(a;)Mdiag(a;))^ 

= diag(a;)(Mdiag(a;)^M)diag(a;) = (Mdiag(a;)^M) a;a;*. 

Rewrite the diagonal matrix as a linear combination of matrix units: diag(a;)^ = L/^,^ E„. The bilinear- 
ity of the Schur product now yields 

(M XX* f = [M (£,.x2e,-,-) M] XX* = £.(ME;,-M) {Xf xx*). 

Take the expectation of this expression to reach 

E{MQxx*f = ^.(ME,-,-M) [E(X2 xx*)]. (4.1) 

Next, we invoke the monotonicity (2.2) of the Schur product to make a diagonal estimate for each 
summand in (4.1); 

(MEiiM) [E{X^xx*)] 4 X^^^iEiX^xx*)) ■ (ME„-M) L 

The Rayleigh-Ritz variational formula [3, Cor. 111.1.2] allows us to write the maximum eigenvalue as a 
supremum. Thus, 

KaAMX^xx*))= sup u*[E{X^xx*)]u= sup E[X,2|«*a;p] 

II u|| = 1 II li|| = 1 

sup (EX4)l/2(]E|u*a;|4)l/2 ^^2^2^ 

||m|| = 1 

The first inequality is Cauchy-Schwarz. The final inequality follows from the definitions (3.1) and (3.2) 
of the concentration parameters. Combine the last two displays to obtain 

(MEiiM) [E{X^xx*)] 4 ■ (MEiiM) 01. (4.2) 

To complete our bound for the variance, we introduce (4.2) into (4. 1), which delivers 

E[MQxx*f ^ pilv^ ■ Ql^ pilv- ■ A\3.g{M^) 

We can control a positive-semidefinite diagonal matrix using only its maximum entry: 

E(M0 a;a;*)2 =<; ^2^2 . inaXi(M2)ii • I = ^^v^ ||M||2^2 ' I 

The second relation follows from the fact that the diagonal entries of M^ list the squared ^2 norms of 
the columns of M, while j|Mj| computes the maximum £2 norm of a column of M. □ 
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4.2 Norm Bound for a Schur Product 

Next, we present a simple norm bound for the Schur product M Qxx*. 

Lemma 4.2 (Norm Bound for a Schur Product) Let M he a p x p self-adjoint matrix, and let a; be a 
vector in M.P. Then 

||M0a;a;*|| s$ ||M|| \\x\\i. 
Proof. The Hadamard product identity (2.1) yields 

||M0a;a;*|| = ||diag(a;)Mdiag(a;)|| ||diag(a;)|| ||M|| ||diag(a;)|| = ||M|| ||a;j|i. 

The inequality follows from the submultiplicativity of the spectral norm. □ 



4.3 The Expected Maximum Entry among the Sample Vectors 

Finally, we develop a basic estimate on the expected maximum entry that appears in any of the sample 
vectors. 

Lemma 4.3 (The Expected Maximum) Consider an i.i.d. sequence {a;,}"^[ of random vectors in W\ 
Then 



Emax; 1 1 a;,- 1 



1/4 



«;inf 



The diagonal moment parameter /i4,. is defined in (3.1). 
Proof. For any r ^ 1, Jensen's inequality yields 





1/4 


Emax; 1 a5/| 1 





|4/- 



l/4r 



n p 



l/4r 



. max,- 1 1 a;/ 1 L ) . ) . E IX,-;!" < (n/?)'/V4r• 

We have written Xjj for the yth entry of a;, and invoked the definition of the diagonal moment 114^. □ 
A. The Matrix Moment Inequality 

This appendix contains a proof of Theorem 3.2, the matrix moment inequality that animates our argu- 
ment. It costs us no additional energy to prove a result that holds for all moments. 

Theorem A.l (Matrix Moment Inequality) Assume that p^3. 

1. Suppose that ^ ^ 1, andfix r ^ max{(7,21og/9}. Consider a finite sequence { Wi } of independent, 
random, positive-semidefinite matrices with dimension px p. Then 



[E 1 1 5^ . f ] 1 1 J^,. E W;- 1 1 ' + 2 (E max; 1 1 W/ 



1/2 



\q\i/2q 



(A.l) 



2. Suppose that q^2, and fix r^ max{q',21og/:>}. Consider a finite sequence { Y,} of independent, 
symmetric, random, self-adjoint matrices with dimension p x p. Then 



|iA/ 



-2er(Emax,i|r,|n'/''. 



(A.2) 
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Theorem 3.2 follows from (A. 2) when we select q = 2 and r = 2 log/?. We establish Theorem A.l 
below after we provide some comments, preliminary results, and historical background. 

Theorem A. 1 shows that the moments of a spectral norm of a sum are controlled by two different 
quantities. The first term in (A.l) is a matrix mean, while the first term in (A. 2) is a matrix variance. 
These terms reflect the size of moderate deviations, and they depend only weakly on the order q of the 
moment. The second term measures the size of the largest summand, and it controls the large deviation 
behavior of the sum. These bounds are related to the matrix Rosenthal inequality [22, 27], and they can 
be viewed as the moment inequality underlying the matrix Bernstein inequality [38, Thm. 1.4]. 



A. 1 Matrix Khintchine Inequality 

The main ingredient in the proof of Theorem A. 1 is the matrix Khintchine inequality. To our knowledge, 
this result is the earliest matrix moment inequality. 

Proposition A.2 (Matrix Khintchine Inequality) Suppose that r ^ 2. Consider a finite sequence {A,} 
of deterministic, self-adjoint matrices. Then 



1/'- 



1/2 



The sequence {^,} consists of independent Rademacher random variables. 

Lust-Picquard established the first version of the matrix Khintchine inequality in [25], with a weaker 
estimate for the constant. Her subsequent paper with Pisier [26] contains important extensions and 
refinements. Buchholz obtained sharp constants for even r in [7]. The recent work [27, Sec. 7] contains 
what may be the easiest proof of Proposition A.2; it yields the near-optimal bound ^ for the constant. 



A.2 Historical Background on Matrix Concentration Inequalities 

Research on matrix moment inequalities contains several strands that date back to the late 1990s. The 
paper [31] of Pisier and Xu initiated the field of noncommutative martingale inequalities. This literature 
contains many powerful moment bounds that can also be used to study sums of independent random 
matrices [19-22]. An early application of this theory appeared in Rudelson's paper [34], which uses the 
matrix Khintchine inequahty to obtain a sample complexity bound for classical covariance estimation. 
Many authors in computer science, mathematical signal processing, and other areas adapted Rudelson's 
method [35, 36, 39]. 

There is a parallel line of work that develops exponential moment inequalities for sums of random 
matrices. This research was initiated in the paper Ahlswede-Winter [1] and continued in a variety of 
other works [14, 27, 30, 32, 37, 38]. Over the last few years, these results have started to see wide 
application. 

As it is stated. Theorem A. 1 seems to be new. Nevertheless, the result is substantially similar to 
some previous matrix concentration inequalities that appear in the literature. Indeed, we have adapted 
the argument from Rudelson's paper [34], the refinements of Rudelson's work in [35, 36], and the recent 
proofs of two matrix Rosenthal inequalities [22, 27]. 



A. 3 Proof of the Matrix Moment Inequality, Part I 

We prove Theorem A.l in two steps. First, we estabhsh the inequality (A.l) for positive matrices. In 
the second stage, we extend this result to obtain the bound (A.2) for general matrices. 
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To begin, we introduce the quantity of interest: 

K/l l/? 



2[E||£.^,W,- 



KniA/ 



(A.3) 



The inequahty follows when we center the sum and apply the standard symmetrization result [23, 
Lem. 6.3]. The sequence {^,} consists of independent Rademacher random variables that are also 
independent from the sequence {P,}. 

Let us focus on the first term on the right-hand side of (A.3): 



ryil< 



1/9 



The inequality holds because the Schatten r-norm dominates the spectral norm, and we have used the 
fact ^ ^ r to apply Jensen's inequality to the inner expectation. An application of the matrix Khintchine 
inequality. Proposition A. 2, delivers the bound 



E 



1/9 



We proceed through a short chain of inequahties to complete the estimate: 



Fq ^ \/er 



^ ver 



:(max,||W,.!r''^-||LW^f'^')l'^' 

9ii/29 riff iiv xM.m^i^i 



^ V^[Emax,|| Will*]'/''' - [E HE,. Wif ] 

In the preceding calculation, we first replace the Schatten r-norm by the spectral norm, which results in a 
loss of at most p'/'^ -/e. Since Wi is positive semidefinite, we can make the bound Wi 
and invoke the monotonicity of the spectral norm on the positive-semidefinite cone to draw off the 
maximum norm achieved by any one of the summands. The last inequality is Cauchy-Schwarz. We 
conclude that 



q^l/lq 



Fq ^ Ve7[Emax/|| W;||''] 

by identifying a copy of Eq. 

To complete the argument, introduce (A.4) into the inequality (A.3): 



(A.4) 



eI ^ 2V^[Emax,- 1| W || + ||E,-IE W;-|| . 
Solutions to the quadratic inequahty !^ax + b satisfy x + \/h. Therefore, 
Eq s$ 2Vi7[Emax, || W,f'] + jj^.E Wi\^'^ . 



This estimate coincides with the bound (A.l). 
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A. 4 Proof of the Matrix Moment Inequality, Part II 



To establish the second bound (A. 2), we apply the matrix Khintchine inequality to obtain a bound 
involving positive-semidefinite random matrices, and then we invoke our first result (A. 1). Indeed, 
since the summands Yi are symmetric random variables. 



[E||£, Y,\\"f" = [E||L,^,-F,|rr'' ^ ^y. i^e, WL.^iY.l 



'11 1/9 



1/9 



The inequality follows from the same considerations as in the proof of (A. 2). Invoke Proposition A. 2 to 
obtain 



|<?1 !/<? 



E 



2n1/2 



1/9 



IE, y. 



2ll'//2 



In the second step, we have replaced the Schatten r-norm with the spectral norm; the third step follows 
from Jensen's inequality. The resulting expression involves a sum of independent, random positive- 
semidefinite matrices. Since q/2 ^ 1, we can apply (A.l) with Wi = Yj^ to reach the conclusion (A.2). 
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